home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Night Owl 6
/
Night Owl's Shareware - PDSI-006 - Night Owl Corp (1990).iso
/
039a
/
mawk.zip
/
MAWK.MAN
< prev
next >
Wrap
Text File
|
1991-04-23
|
21KB
|
640 lines
Mawk Manual
Mawk implements the awk language as defined in Aho, Kernighan and
Weinberger, The AWK Programming Language, Addison-Wesley, 1988, ISBN
0-201-07981-X, hereafter called the AWK book. Chapter 2 serves as a
reference to the language and the rest (8 total chapters) provides a
wide range of examples and applications. This book is must reading to
understand the versatility of the language.
The 1988 version of the language is sometimes called new awk as opposed
to the 1977 version (awk or old awk.) Virtially every Unix system has
old awk, somewhere in the documentation will be an (old) awk tutorial
(probably in support tools). If you use (old) awk, the transition to
new awk is easy. The language has been extended and ambiguous points
clarified, but old awk programs still run under new awk.
This manual assumes you know (old) awk, and hence concentrates on the
new features of awk. Feature xxx is new means xxx was added to the 1988
version.
Experienced new awk users should read sections 9 and 12, and skim
sections 7 and 8.
1. Command line
mawk [-Fs] 'program' optional_list_of_files
mawk [-Fs] -f program_file optional_list_of_files
2. Program blocks
Program blocks are of the form:
pattern { action }
pattern can be:
regular_expression
expression
( pattern )
! pattern
pattern || pattern
pattern && pattern
pattern , pattern (range pattern)
BEGIN
END
Range, BEGIN and END patterns cannot be combined to form new patterns.
BEGIN and END patterns require an action; otherwise, if action is
omitted it is implicitly { print }.
NR==2 { print } # prints line number 2
NR==2 # also prints line number 2
If pattern is omitted then action is always applied.
{ print $NF }
prints the last field of every record.
3. Statement format and loops
Statements are terminated by newlines, semi-colons or both. Groups of
statements are blocked via { ... } as in C. The last statement in a
block doesn't need a terminator. Blank lines have no meaning; an empty
statement is terminated with a semi-colon. Long statements can be
continued with a backslash, \. A statement can be broken without a
backslash after a comma, left brace, &&, ||, do, else, the right
parenthesis of an if, while or for statement, and the right parenthesis
of a function definition.
Loops are for(){}, while(){} and do{}while() as in C.
4. Expression syntax
The expression syntax and grouping of the language is similar to C.
Primary expressions are numeric constants, string constants, variables,
arrays and functions. Complex expressions are composed with the
following operators in order of increasing precedence.
assignment: = += -+ *= /= ^=
conditional: ? :
logical or: ||
logical and: &&
array membership : in
matching : ~ !~
relational : < > <= >= == !=
concatenation: (no explicit operator)
add ops: + -
mul ops: * / %
unary : + -
logical not : !
exponentiation: ^
inc and dec: ++ -- (both post and pre)
field: $
5. Builtin variables.
The following variables are built-in and initialized before program
execution.
ARGC number of command line arguments
ARGV array of command line arguments, 0..ARGC-1
FILENAME name of the current input file
FNR current record number in the current input file
FS splits records into fields as a regular expression
NF number of fields in the current record, i.e., $0
NR current record number in the total input stream
OFMT format for printing numbers; initially = "%.6g"
OFS inserted between fields on output, initially = " "
ORS terminates each record on output, initially = "\n"
RLENGTH length of the last call to the built-in function, match()
RS input record separator, initially = " "
RSTART index of the last call to match()
SUBSEP used to build multiple array subscripts, initially = "\034"
VERSION Mawk version, unique to mawk.
ARGC, ARGV, FNR, RLENGTH, RSTART and SUBSEP are new.
The current input record is stored in the field, $0. The fields of $0
determined by splitting with RS are stored in $1, $2, ..., $NF.
6. Built-in Functions
String functions
index(s,t)
length(s), length
split(s, A, r), split(s, A)
substr(s,i,n) , substr(s,i)
sprintf(format, expr_list)
match(s,r) returns the index where string s matches
regular expression r or 0 if no match. As
a side effect, sets RSTART and RLENGTH.
gsub(r, s, t) Global substitution, every match of regular
expression r in variable t is replaced by s.
The number of matches/replacements is returned.
sub(r, s, t) Like gsub(), except at most one replacement.
Match(), gsub() and sub() are new. If r is an expr it is coerced to
string and then treated as a regular expression. In sub and gsub, t can
be a variable, field or array element, i.e., it must have storage to
hold the modification. Sub(r,s) and gsub(r,s) are the same as
sub(r,s,$0) and gsub(r,s,$0). In the replacement string s, an & is
replaced by the matched piece and a literal & is obtained with \&.
E.g.,
y = x = "abbc"
sub(/b+/, "B&B" , x)
sub(/b+/, "B\&B" , y)
print x, y
outputs: aBbbBc aB&Bc
Arithmetic functions
atan2(y,x) arctan of y/x between -pi and pi.
cos(x)
exp(x)
int(x) x.dddd -> x.0
log(x)
rand() returns random number , 0 <= r < 1.
sin(x)
sqrt(x)
srand(x) , srand() seeds random number generator, uses clock
if x is omitted.
Output functions
print writes $0 ORS to stdout.
print expr1 , expr2 , ... , exprn
writes expr1 OFS expr2 OFS ... OFS exprn ORS to
stdout.
printf format, expr_list
Acts like the C library function, writing to
stdout. Supported conversions are
%c, %d, %e, %f, %g, %o, %s and %x.
- , width and .prec are supported.
Dynamic widths can be built using string operations
Output can be redirected
print[f] > file
>> file
| command
File and command are awk expressions that are interpreted as a filename
or a shell command.
Input functions
getline read $0, update NF, NR and FNR.
getline < file read $0 from file, update NF.
getline var read var from input stream, update NR, FNR.
getline var < file read var from next record of file
command | getline read $0 from piped command, update NF.
command | getline var read var from next record of piped command.
(Old) awk had getline, the redirection facilities are new.
Files or commands are closed with
close(expr)
where expr is command or file as a string. Close returns 0 if expr was
in fact an open file or command else -1. Close is needed if you want to
reread a file, rerun a command, have a large number of output files
without mawk running out of resources or wait for an output command to
finish. Here is an example of the last case:
{ .... do some processing on each input line
# send the processed line to sort
print | "sort > temp_file"
}
END { # reread the sorted input
close( "sort > temp_file") # makes sure sort is finished
cnt=1
while ( getline line[cnt++] < "temp_file" > 0 ) ;
system( "rm temp_file" ) # cleanup
... process line[1], line[2] ... line[cnt-1]
}
The system() function executes a command and returns the command's exit
status. Mawk uses the shell in the environment variable SHELL to
execute system or command pipelines; defaulting to "/bin/sh" if SHELL is
not set.
7. String constants
String constants are written as in C.
"This is a string with a newline at the end.\n"
Strings can be continued across a line by escaping (\) the newline. The
following escape sequences are recognized.
\\ \
\" "
\' '
\a alert, ascii 7
\b backspace, ascii 8
\t tab, ascii 9
\n newline, ascii 10
\v vertical tab, ascii 11
\f formfeed, ascii 12
\r carriage return, ascii 13
\ddd 1, 2 or 3 octal digits for ascii ddd
\xhh 1 or 2 hex digits for ascii hh
If you escape any other character \c, you get \c, i.e. the escape is
ignored. Mawk is different than most awks here; the AWK book says \c is
c. The reason mawk chooses to be different is for easier conversion of
strings to regular expressions.
8. Regular expressions
Awk notation for regular expressions is in the style of egrep(1). In
awk, regular expressions are enclosed in / ... /. A regular expression
/r/, is a set of strings.
s ~ /r/
is an awk expression that evaluates to 1 if an element of /r/ is a
substring of s and evaluates to 0 otherwise. ~ is called the match
operator and the expression is read "s matches r".
s ~ /^r/ is 1 if some element of r is a prefix of s.
s ~ /r$/ is 1 if some element of r is a suffix of s.
s ~ /^r$/ is 1 if s is an element of r.
Replacing ~ by !~ , the not match operator, reverses the meanings. In
patterns, /r/ and !/r/ are shorthand for $0 ~ /r/ and $0 !~ /r/.
Regular expressions are combined by the following rules.
// stands for the one element set "" (not the empty set).
/c/ for a character c is the one element set "c".
/rs/ is all elements of /r/ concatenated with all
elements of /s/.
/r|s/ is the set union of /r/ and /s/.
/r*/ called the closure of r is // union /rr/ union /rrr/ ...
In words, r repeated zero or more times.
The above operations are sufficient to describe all regular expressions,
but for ease of notation awk defines additional operations and notation.
/r?/ // union /r/. In words r 0 or 1 time.
/r+/ Positive closure of r. R 1 or more times.
(r) Same as r -- allows grouping.
. Stands for any character (for mawk this means
ascii 1 through ascii 255)
[c1c2..cn] A character class same as (c1|c2|...|cn) where
ci's are single characters.
[^c1c2..cn] Complement of the class [c1c2..cn]. For mawk
complement in the ascii character set 1 to 255.
Ranges c1-cn are allowed in character classes. For example,
/[_a-zA-Z][_a-zA-Z0-9]*/
expresses the set of possible identifiers in awk.
The operators have increasing precedence:
|
implicit concatenation
+ * ?
So /a|b+/ means a or (1 or more b's), and /(a|b)+/ means (a or b) one or
more times. The so called regular expression metacharacters are \ ^ $ .
[ ] | ( ) * + ? . To stand for themselves as characters they have to be
escaped. (They don't have to be escaped in classes, inside classes the
meta-meaning is off). The same escape sequences that are recognized in
strings (see above) are recognized in regular expressions. For mawk,
the escape rule for \c changes to c.
For example,
/[ \t]*/ is optional space
/^[-+]?([0-9]+\.?|\.[0-9])[0-9]*([eE][-+]?[0-9]+)?$/
is numbers in the Awk language.
Note, . must be escaped to have
its meaning as decimal point.
For building regular expressions, you can think of ^ and $ as phantom
characters at the front and back of every string. So /(^a|b$|^A.*B$)/
is the set of strings that start with a or end with b or (start with A
and end with B).
Dynamic regular expressions are new. You can write
x ~ expr
and expr is interpreted as a regular expression. The result of x ~ y
can vary with the variable y; so
x ~ /a\+b/ and x ~ "a\+b"
are the same, or are they? In mawk, they are; in some other awk's they
are not. In the second expression, "a\+b" is scanned twice: once as a
string constant and then again as a regular expression. In mawk the
first scan gives the four character string 'a' '\' '+' 'b' because mawk
treats \+ as \+; the second scan gives a regular expression matched by
the three character string 'a' '+' 'b' because on the second scan \+
becomes +.
If \c becomes c in strings, you need to double escape metacharacters,
i.e., write
x ~ "a\\+b".
Exercise: what happens if you double escape in mawk?
In strings if you only escape characters with defined escape sequences
such as \t or \n or meta-characters when you expect to use a string as a
regular expression, then mawk's rules are intuitive and simple. See
example/cdecl.awk and example/gdecl.awk for the same program with single
and double escapes, the first is clearer.
9. How Mawk splits lines, records and files.
Mawk uses the essentially the same algorithm to split lines into pieces
with split(), records into fields on FS, and files into records on RS.
Split( s, A, sep ) splits string s into array A with separator sep as
follows:
Sep is interpreted as a regular expression.
If s = "", there are no pieces and split returns 0.
Otherwise s is split into pieces by the matches with sep
of positive length treated as a separator between pieces,
so the number of pieces is the number of matches + 1.
Matches of the null string do not split.
So sep = "b+" and sep = "b*" split the same although the
latter executes more slowly.
Split(s, A) is the same as split(s, A, FS).
With mawk you can write sep as a regular expression, i.e.,
split(s, A, "b+") and split(s, A, /b+/) are the same.
Sep = " " (a single space) is special. Before the algorithm is
applied, white-space is trimmed from the front and back of s.
Mawk defines white-space as SPACE, TAB, FORMFEED, VERTICAL TAB
or NEWLINE, i.e [ \t\f\v\n]. Usually this means SPACE or TAB
because NEWLINE usually separates records, and the other
characters are rare. The above algorithm
is then applied with sep = "[ \t\f\v\n]+".
If length(sep) = 1, then regular expression metacharacters do
not have to be escaped, i.e. split(s, A, "+") is the same as
split(s, A, /\+/).
Splitting records into fields works exactly the same except the pieces
are loaded into $1, $2 ... $NF.
Records are also the same, RS is treated as a regular expression. But
there is a slight difference, RS is really a record terminator (ORS is
really a terminator also).
E.g., if FS = ":" and $0 = "a:b:" , then
NF = 3 and $1 = "a", $2 = "b" and $3 = "", but
if "a:b:" is the contents of an input file and RS = ":", then
there are two records "a" and "b".
RS = " " does not have special meaning as with FS.
Not all versions of (new) awk support RS as a regular expression. This
feature of mawk is useful and improves performance.
BEGIN { RS = "[^a-zA-Z]+"
getline
if ( $0 == "" ) NR = 0
else word[1] = $0
}
{ word[NR] = $0 }
END { ... do something with word[1]...word[NR] }
isolates words in a document over twice as fast as reading one line at a
time and then examining each field with FS = "[^a-zA-Z]+".
To remove comments from C code:
BEGIN { RS = "/\*([^*]|\*[^/])*\*/" # comment is RS
ORS = " "
}
{ print }
END { printf "\n" }
10. Multi-line records
Since mawk interprets RS as a regular expression, multi-line records are
easy. Setting RS = "\n\n+", makes one or more blank lines separate
records. If FS = " " (the default), then single newlines, by the rules
for space above, become space.
For example, if a file is "a b\nc\n\n", RS = "\n\n+" and
FS = " ", then there is one record "a b\nc" with three
fields "a", "b" and "c". Changing FS = "\n", gives two
fields "a b" and "c"; changing FS = "", gives one field
identical to the record.
For compatibility with (old) awk, setting RS = "" has the same
effect on determining records as RS = "\n([ \t]*\n)+".
Most of the time when you change RS for mult-line records, you
will also want to change ORS to "\n\n".
11. User functions.
User defined functions are new. They can be passed expressions by value
or arrays by reference. Function calls can be nested and support
recursion. The syntax is
function funcname( args ) {
.... body
}
Newlines are ignored after the ')' so the '{' can start on a different
line. Inside the body, you can use a return statement
return expr
return
As in C, there is no distinction between functions and procedures. A
function does not need an explicit return. Extra arguments act as local
variables. For example, csplit(s, A) puts each character of s in array
A.
function csplit(s, A, i)
{
for(i=1; i <= length(s) ; i++)
A[i] = substr(s, i, 1)
}
Putting lots of space between the passed arguments and the local
variables is a convention that can be ignored if you don't like it.
Dynamic regular expressions allow regular expressions to be passed to
user defined functions. The following function gobble() is the lexical
scanner for a recursive descent parser, the whole program is in
examples/cdecl.awk.
function gobble( r, x) # eat regular expression
# r off the front of global variable line
{
if ( match( line, "^(" r ")") )
{
x = substr(line, 1, RLENGTH)
line = substr(line, RLENGTH)
}
else x = ""
return x
}
You can call a function before it is defined, but the function name and
the '(' must not be separated by white space to avoid confusion with
concatenation.
12. Other differences in mawk
The main differences between mawk and other awks have been discussed, RS
as a regular expression and regular expression metacharacters don't have
to be double escaped. Here are some others:
VERSION -- built-in variable holding version number of mawk.
mawk 'BEGIN{print VERSION}' shows it.
-D -- command line flag causes mawk to dump to stderr
a mawk assembler listing of the current program.
The program is executed by a stack machine internal
to mawk. The op codes are in code.h, the machine in
execute.c.
srand() --
During initialization, mawk seeds the random number generator
by silently calling srand(), so calling srand() yourself is
unnecessary. The main use of srand is to use srand(x) to get
a repeatable stream of random numbers. Srand(x) returns x
and srand() returns the value of the system clock in some form
of ticks.
13. MsDOS
For a number of reasons, entering a mawk program on the command line
using command.com as your shell is an exercise in futility, so under
MsDOS the command syntax is
mawk [-Fs] optional_list_of_files
You'll get a prompt, and then type in the program. The -f option works
as before.
If you use a DOS shell that gives you a Unix style command line, to use
it you'll need to provide a C function reargv() that retrieves argc and
argv[] from your shell. The details are in msdos/INSTALL.
Some features are missing from the DOS version of mawk: No system(), and
no input or output pipes. To provide a hook to stderr, I've added
errmsg( "string" )
which prints "string\n" to stderr which will be the console and only the
console under command.com. A better solution would be to associate a
file with handle 2, so print and printf would be available. Consider
the errmsg() feature as temporary.
For compatibility with Unix, CR are silently stripped from input and LF
silently become CRLF on output.
WARNING: If you write an infinite loop that does not print to the
screen, then you will have to reboot. For example
x = 1
while( x < 10 ) A[x] = x
x++
By mistake the x++ is outside the loop. What you need to do is type
control break and the keyboard hardware will generate an interrupt and
the operating system will service that interrupt and terminate your
program, but unfortunately MsDOS does not have such a feature.
14. Bugs
Currently mawk cannot handle \0 (NUL) characters in input files
otherwise mawk is 8 bit clean. Also "a\0b", doesn't work right -- you
get "a". You can't use \0 in regular expressions either.
printf "A string%c more string\n" , 0
does work, but more by luck than design since it doesn't work with
sprintf().
15. Releases
This release is version 0.97. After a reasonable period of time, any
bugs that appear will be fixed, and this release will become version
1.0.
Evidently features have been added to awk by Aho, Kernighan and
Weinberger since the 1988 release of the AWK book. Version 1.1 will add
whatever features are necessary to remain compatible with the language
as defined by its designers.
After that ... ?
16. Correspondence
Send bug reports or other correspondence to
Mike Brennan
brennan@bcsaic.boeing.com
If you have some interesting awk programs, contributions to the examples
directory would be appreciated.